Introduction

In this R Notebooks, I will explore the distribution of Zipcodes accross DMAs. When we map the DMAs based on the zipcodes, we get that a given DMA is widespread accross the US as shown in the map below. This is noise in the data. But what percentage is noise?

dma_zip

Hypothesis

For a given zipcode, the DMA that it most frequently is associated with is the correct zipcode-DMA association. All the rest are outliers. Therefore, we will calculate the percentage of occurance of a given zipcode accross different DMAs to detect the outliers.


Group by geo_zip and geo_dma

The table below shows sample of records of geo_zip and geo_dma with their count. These are the most frequent ones.


Group by zipcode

The table below shows the total instances by zipcode. It will help us to analyze what percentage of records of a given zipcode is found in a given DMA.


Calculate percentage

Now, let’s calculate percentage of a given zipcode records across a DMA.


Distribution of the percentages

Let’s see the distibution of the percentages.

We see two peaks near 0 and 100. Let’s use log scale to further understand the distribution.


Remarks:


How many zipcodes have maximum percentage more than 95%?

[1] 32793


How many unique zipcodes do we have in the data?

[1] 33089

We see that 296 zipcodes (33089-32793), which is about 0.9%, have less than 95% maximum percentage of occrance in a given DMA.


Let’s see their distibution.


How many records are outliers

There are  1123285 outlier zipcodes out of a total of 451653467 records
This is 0.2487051 percent


Correct the incoccert ZIP Code to DMA pairs

correct_zip_dma = df %>% anti_join(outliers, by = c("geo_dma", "geo_zip")) %>% 
                 select(geo_dma, geo_zip) %>% distinct()
in_correct_zip_dm = outliers %>% select(geo_dma, geo_zip) %>% distinct()

Final Clean Data

final_df = df %>% select(-geo_dma) %>% left_join(correct_zip_dma, by = "geo_zip")

write.csv(final_df, "clean_geo_zip_dma_mapping.csv", row.names = FALSE)

Summary

So, there are less than 0.25% of zipcodes outside of their DMA (outliers) assuming that those with maximum percentages for each geo-zip and geo-dma combination are correct. After replacing the incorrect ones with the correct zipcode-DMA mapping, we get the map below. dma_zip_clean